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1  Abstract 


In  the  third  quarter  of  the  work  effort,  we  focused  on  the  research  and  design  of  the  randomized 
Approximate  Nearest  Neighbors  (ANN)  algorithm.  This  randomized  variant  of  the  ANN 
algorithm  has  theoretically  proven  improvements  in  the  number  of  data  dimensions  that  it  can 
handle  over  existing  algorithms  and  meets  the  theoretical  lower  bounds  for  computational 
complexity.  Algorithm  designs  for  computing  the  Randomized  Approximate  Nearest  Neighbors 
(ANN)  using  randomized  Fast  Fourier  Transform  projections  were  completed.  Fortran  95 
interface  for  reusable  randomized  ANN  routine  has  been  defined  and  implemented.  The 
randomized  ANN  implementation  uses  BLAS  libraries  via  standardized  interfaces  to  make 
optimal  use  of  hardware  resources  (e.g.,  multiple  cores,  CPU  cache)  in  addition  to  using  the 
OpenMP  standard  (for  parallel  execution  of  code).  Use  of  these  standards  enables  the  code  to  be 
built  flexibly  in  a  number  of  ways  on  various  target  platforms.  Preliminary  testing  of  the 
software  is  complete.  Additional  updating,  fine  tuning  will  be  based  on  results  from  various 
experiments  that  will  be  conducted  in  the  upcoming  quarter. 

The  project  is  currently  on  track  -  in  the  upcoming  quarter,  we  will  focus  on  testing  and 
conducting  experiments  for  the  randomized  SVD  and  ANN  algorithms.  This  also  includes 
documentation  and  packaging  efforts.  No  problems  are  currently  anticipated. 


Use  or  disclosure  of  data  contained  on  this  sheet  is  subject  to  restrictions  on  the  title  page  of  this  report. 


Page  ii 


flPTelcordia® 


ISRN  TELCORDIA --2011-03+PR-0GARAU 
Technical  Progress  Report 
Table  of  Contents 


Table  of  Contents 

1  ABSTRACT . II 

2  SUMMARY . 1 

3  INTRODUCTION . 2 

4  METHODS,  ASSUMPTIONS  AND  PROCEDURES . 3 

4.1  Randomized  Approximate  Nearest  Neighbor  Algorithm . 3 

4.1.1  Randomization  Step . 3 

4.1.2  Tree  Construction  Step . 3 

4.1.3  Supercharging  Step . 3 

4.1.4  Query . 3 

4.2  Deliverables  /  Milestones . 4 

5  RESULTS  AND  DISCUSSION . 5 

5.1  Test  Setup . 5 

5.2  Test  Results . 5 

6  CONCLUSIONS . 8 

7  RELERENCES . 9 


Use  or  disclosure  of  data  contained  on  this  sheet  is  subject  to  restrictions  on  the  title  page  of  this  report. 


Page  iii 


flPTelcordia® 


ISRN  TELCORDIA --2011-03+PR-0GARAU 
Technical  Progress  Report 
Summary 


2  Summary 


The  focus  for  this  quarter  was  on  the  research  and  development  of  the  randomized  Approximate 
Nearest  Neighbors  algorithm  using  random  Fast  Fourier  Transform  projections.  Fortran  95 
interface  for  reusable  randomized  ANN  routine  has  been  defined  and  implemented. 

The  implementation  uses  the  Basic  Linear  Algebra  Subprograms  (BLAS)  standard  apart  from  the 
OpenMP  standard  (for  parallel  execution  on  multi-core/multiple  CPUs).  Preliminary  tests  were 
satisfactory  -  especially  those  involving  real-world  data  in  text  retrieval  applications.  Further 
updates  and  fine-tuning  will  be  based  on  testing  and  experiments  conducted  in  the  upcoming 
quarter. 

The  project  is  currently  on  track  -  we  will  focus  on  testing  and  conducting  experiments  for  the 
randomized  SVD  and  ANN  algorithms  in  the  next  quarter.  No  problems  are  currently 
anticipated. 


Use  or  disclosure  of  data  contained  on  this  sheet  is  subject  to  restrictions  on  the  title  page  of  this  report. 


Page  1 


flPTelcordia® 


ISRN  TELCORDIA --2011-03+PR-0GARAU 
Technical  Progress  Report 
Introduction 


3  Introduction 


The  primary  project  effort  over  the  last  quarter  focused  on  the  design  and  development  of  fast, 
scalable  Approximate  Nearest  Neighbor  (ANN)  algorithm  using  randomized  Fast  Fourier 
Transform  projections  (see  [1]).  The  algorithm  has  been  implemented  as  a  reusable  routine  in 
Fortran  95  with  well-defined  interfaces.  The  implementation  uses  the  standard  Basic  Linear 
Algebra  Subroutines  (BLAS)  [2]  interface  and  the  OpenMP  API  [7].  This  provides  flexibility  in 
terms  of  compiling  the  software  on  a  variety  of  target  platforms;  exploiting  availability  of 
optimized  BLAS  libraries  (see  [3][4][5][6])  and  availability  of  multiple  cores/CPUs. 

Preliminary  tests  have  been  conducted  -  additional  testing  and  experimentation  will  be  carried 
out  in  the  upcoming  quarter  for  both  the  randomized  SVD  and  ANN  algorithms  resulting  in 
possible  improvement  and  fine-tuning  of  the  software. 
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4  Methods,  Assumptions  and  Procedures 


4.1  Randomized  Approximate  Nearest  Neighbor  Algorithm 

Our  implementation  of  the  randomized  ANN  involves  the  use  of  randomized  projections  using 
the  Fast  Fourier  Transform  (see  [1]  for  details). 

The  algorithm  may  be  broadly  decomposed  into  three  steps  -  a)  a  randomization  step  where  the 
points  in  the  input  data  set  are  subject  to  a  random  projection  (to  a  lower  dimensional  space),  b) 
construction  of  a  tree  data  structure  to  partition  the  points  using  the  medians  of  the  first  few 
coordinates,  and  finally  c)  refining  the  bins  using  supercharging  for  subsequent  queries. 

Note:  Assume  the  input  data  set  is  of  size  n  and  contains  points  of  dimension  d.  For  all  three 
steps,  BLAS  operations  are  used  wherever  appropriate. 

4.1.1  Randomization  Step 

In  this  step,  we  apply  a  pseudo-random  orthogonal  linear  transformation  matrix  to  each  data 
point.  This  step  is  easily  parallelizable  since  the  random  projection  may  be  applied  to  each  point 
independently.  We  choose  a  pseudo-random  transformation  matrix  based  on  the  Fast  Fourier 
Transforms  (FFT)  which  is  less  expensive  to  compute  than  say,  drawing  random  numbers  from  a 
Gaussian  distribution. 

4.1.2  Tree  Construction  Step 

Here,  we  construct  a  tree  data  structure  of  depth  .  Set  the  bin  at  level  0  to  all  the 

data  points.  For  a  given  level  /,  use  the  median  of  the  /- th  coordinate  to  partition  each  bin  at  the 
previous  level  into  two  bins  (based  on  whether  the  /-th  coordinate  of  the  selected  data  point  is 
less-that-or-equal-to  or  greater-than  the  median)  at  the  current  level.  A  label  comprising  „L"  or 
„G'  may  be  then  assigned  to  each  leaf  bin  indicating  the  less-that/greater-than  decision  at  each 
level  of  the  tree  (e.g.,  “LLGL”). 

For  each  leaf  bin  (at  level  L),  compute  the  k  nearest  neighbors  by  searching  over  the  adjacent 
bins.  Adjacent  bins  are  defined  as  those  with  labels  that  are  exactly  unit  edit  distance  away. 

4.1.3  Supercharging  Step 

For  each  point,  consider  the  set  of  all  its  current  GANN  and  the  GANNs  for  each  of  them.  Then, 
reconstruct  the  GANN  using  this  bigger  set. 

4.1.4  Query 

The  data  structure  constructed  in  the  previous  steps  is  used  to  determine  the  GANN  for  any 
given  data  point. 

If  the  data  point  is  member  of  the  original  data  set,  then  the  GANN  set  for  that  data  point  is 
simply  retrieved  from  the  data  structure  and  returned. 
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If  the  data  point  is  not  a  member  of  the  original  data  set,  then  the  random  projection  and  the 
partitioning  tree  is  used  to  find  the  appropriate  leaf  bin  for  the  new  data  point.  A  k- NN  search  is 
performed  on  the  smaller  set  of  all  the  points  in  the  bin  as  well  as  all  their  current  &-ANNs  to 
determine  the  A:- ANN  result  set. 

In  practice,  multiple  rounds  of  the  algorithm  along  with  one  or  more  supercharging  iterations  are 
found  to  provide  the  best  results. 


4.2  Deliverables  /  Milestones 


Date 

Deliverables  /  Milestones 

Status 

Oct  2010 

Progress  report  for  period  1,  1st  quarter 

v7 

Jan  2011 

Progress  report  for  period  1 ,  2nd  quarter  /  complete  randomized  matrix  decompositions  task 

Apr  2011 

Progress  report  for  period  1,  3rd  quarter  /  complete  approximate  nearest  neighbors  task 

Jul  2011 

Progress  report  for  period  1 ,  4th  quarter  /  complete  experiments  -  part  1 

Oct  2011 

Progress  report  for  period  2,  1st  quarter 

Jan  2012 

Progress  report  for  period  2,  2nd  quarter  /  complete  multiscale  SVD  task 

Apr  2012 

Progress  report  for  period  2,  3rd  quarter 

Jul  2012 

Progress  report  for  period  2,  4th  quarter  /  complete  experiments  -  part  2 

Oct  2012 

Progress  report  for  period  3,  1st  quarter 

Jan  2013 

Progress  report  for  period  3,  2nd  quarter  /  complete  multiscale  Heat  Kernel  task 

Apr  2013 

Progress  report  for  period  3,  3rd  quarter 

Jul  2013 

Final  project  report  +  software  +  documentation  on  CDROM  /  complete  experiments  -  part  3 
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5  Results  and  Discussion 


We  present  some  results  from  our  preliminary  testing  of  the  randomized  Approximate  Nearest 
Neighbors  algorithm  to  highlight  the 

a)  accuracy  of  the  randomized  ANN,  and 

b)  applicability  to  real-world  text  IR  problems. 

5.1  Test  Setup 

A  data  set  containing  400  dimensional  points  of  size  35,270  was  constructed  from  analyzing  text 
documents  from  eight  different  sources.  Each  point  corresponds  to  a  body  of  text  (a.k.a. 
document).  A  standard  task  in  Information  Retrieval  (IR)  is  to  find  similar  documents  given  a 
query  document.  This  requires  computing  the  nearest  neighbors  subject  to  some  threshold  on  the 
distance.  In  this  test,  we  apply  the  randomized  ANN  algorithm  to  this  problem.  The  number  of 
nearest  neighbors  k  was  set  to  30  with  a  distance  threshold  of  0.8.  Further,  each  document  vector 
has  unit  norm. 

All  tests  were  carried  out  on  a  machine  with  a  Intel  Core  Duo  E6550  2.33GHz  2-core  CPU  (total 
of  2  cores)  with  a  total  of  4  GB  main  memory.  The  OS  was  Ubuntu  10.10  64-bit  SMP. 

5.2  Test  Results 


♦  No  supercharging 
■  Supercharged  Once 
A  Supercharged  Twice 


Figure  1:  Fraction  missed  for  30-ANN  with  distance  threshold  of  0.8 

Figure  1  shows  the  fraction  missed  (the  baseline  was  the  exact  30-NN  for  all  the  data  points)  in 
computing  the  30-ANN  using  our  algorithm  for  all  the  points  in  the  data  set.  It  highlights  the 
performance  boost  obtained  using  supercharging.  The  error  is  about  0.5%  which  is  within 
acceptable  limits.  Figure  2  and  shows  the  corresponding  run  times.  The  time  for  computing  the 
exact  30-NN  for  all  the  data  points  was  recorded  to  be  4701  seconds. 
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Run  Times 


♦  No  Supercharging 
■  Supercharge  Once 
A  Supercharge  Twice 


Rounds 


Figure  2:  Run  times  for  30-ANN  with  distance  threshold  of  0.8 


Fraction  of  30  neighbors  within  distance  0.8  missed 


Seconds 


♦  No  supercharging 
■  Supercharged  Once 
A  Supercharged  Twice 
■*-=  Brute  Force 


Figure  3:  Run  times  for  30-ANN  with  distance  threshold  of  0.8 


Figure  4  shows  the  effects  of  multiple  rounds  on  the  quality  of  ANN  results  with  different 
distance  thresholds  (cut-offs).  Overall,  the  algorithm  shows  stability  with  approximately  0.5% 
error  for  thresholds  of  interest  with  a  performance  gain  of  90-94%  over  the  exact  /c-NN 
algorithm.  As  expected,  the  error  grows  as  the  distance  threshold  is  significantly  increased. 


Use  or  disclosure  of  data  contained  on  this  sheet  is  subject  to  restrictions  on  the  title  page  of  this  report. 


Page  6 


flPTelcordia® 


ISRN  TELCORDIA --2011-03+PR-0GARAU 
Technical  Progress  Report 
Conclusions 


Approximate  Nearest  Neighbors  --  Fraction  of  True  Neighbors  Missed 
Among  30  Nearest  Neighbors  up  to  Distance  Cutoff  --  Document  Matrix 


1 


0.1 


0.01 


0.001 


0.0001 


Distance  Cutoff 


♦♦♦♦♦  ♦ 


♦  1  Round 
■  2  Rounds 
A  3  Rounds 
X  4  Rounds 
X 5  Rounds 

♦  6  Rounds 
7  Rounds 

-  8  Rounds 
9  Rounds 

♦  10  Rounds 


Figure  4:  Effect  of  number  of  rounds  on  fraction  missed  and  distance  cutoff 
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The  project  is  on  track  with  the  completion  of  design  and  implementation  of  the  second 
algorithm  (Randomized  Approximate  Nearest  Neighbors).  We  will  continue  with  further  testing 
and  application  of  the  randomized  SVD  and  ANN  algorithms  to  a  variety  of  real-world  data  sets 
in  the  next  quarter. 

No  problems  are  currently  anticipated. 
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